There are many roads here in Maryland witha reputation for being dangerous. For example, here in College Park, Route One has something of a reputation for being dangerous.
For this project I wanted to evaluate which roads in Maryland are actually the most dangerous and see if I could get some insights about why this may be the case. I found a dataset on Maryland's opendata of all crashes in Maryland since 2015. I downloaded it locally since opendata.maryland.gov goes down for maintenence not infrequently, but attatched is a link to where I got the dataset: https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn
import folium
from datetime import datetime
crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
/tmp/ipykernel_78/2716668262.py:10: DtypeWarning: Columns (33,45) have mixed types. Specify dtype option on import or set low_memory=False. crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
The dataset was large to a point where my computer was struggling, so since I'm looking to evaluate Route One in context of other roads in Maryland, I only looked at data which had named roads, as well as restricting the years being looked at to only 2021 and 2022. I then took out some road names that weren't appropriate, such as 'ENT TO BUSINESS' or 'NO NAME'. This was to ensure I was only looking at roads that were truly roads and not just some catch all referring to multiple locations. I also removed all crashes without latitude and longitude since I'm looking to map the data.
#crashes = crashes[crashes['COUNTY_DESC'] =='Prince George\'s' ]
crashes = crashes[pd.notnull(crashes['REFERENCE_ROAD_NAME'])]
crashes = crashes[pd.notnull(crashes['LATITUDE'])]
crashes = crashes[pd.notnull(crashes['LONGITUDE'])]
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'ENT TO BUSINESS']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'NO NAME']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'BEGIN BRIDGE']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'CROSSOVER']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'END BRIDGE']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'ROAD END']
crashes = crashes[crashes['REFERENCE_ROAD_NAME'] != 'ENT TO SHOPPING CENTER']
crashes = crashes[~crashes['REFERENCE_ROAD_NAME'].str.contains('EXIT')]
display(crashes)
crashes['datetime'] = crashes['ACC_DATE'].astype(str)+crashes['ACC_TIME'].astype(str)
crashes['datetime'] = crashes['datetime'].apply(lambda x: datetime.strptime(x,"%Y%m%d%H:%M:%S"))
print(len(crashes['REFERENCE_ROAD_NAME'].unique()))
| YEAR | QUARTER | LIGHT_DESC | LIGHT_CODE | COUNTY_DESC | COUNTY_NO | MUNI_DESC | MUNI_CODE | JUNCTION_DESC | JUNCTION_CODE | ... | FEET_MILES_FLAG_DESC | FEET_MILES_FLAG | DISTANCE_DIR_FLAG | REFERENCE_NO | REFERENCE_TYPE_CODE | REFERENCE_SUFFIX | REFERENCE_ROAD_NAME | LATITUDE | LONGITUDE | LOCATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2020 | Q2 | NaN | 6.02 | Baltimore City | 24.0 | NaN | NaN | Non Intersection | 1.0 | ... | Miles | M | N | NaN | NaN | NaN | NORTH AVE | 39.311025 | -76.616429 | POINT (-76.616429453205 39.311024794431) |
| 7 | 2022 | Q2 | NaN | 6.02 | Baltimore | 3.0 | NaN | 0.0 | Intersection Related | 3.0 | ... | Feet | F | S | 0.0 | UU | NaN | QUIET STREAM CT | 39.463704 | -76.608660 | POINT (-76.608659683333 39.46370405) |
| 11 | 2022 | Q2 | Daylight | 1.00 | Cecil | 7.0 | NaN | 0.0 | NaN | 88.0 | ... | Miles | M | N | 273.0 | MD | NaN | TELEGRAPH RD | 39.699010 | -75.814360 | POINT (-75.814360383333 39.6990097) |
| 17 | 2021 | Q4 | Daylight | 1.00 | Baltimore | 3.0 | NaN | NaN | Not Applicable | 0.0 | ... | Feet | F | N | NaN | NaN | NaN | EASTERN BLVD | 39.322488 | -76.456602 | POINT (-76.456602128426 39.32248773136) |
| 24 | 2021 | Q4 | Dark Lights On | 3.00 | Baltimore | 3.0 | NaN | 0.0 | Intersection | 2.0 | ... | Feet | F | N | 3033.0 | CO | NaN | LINDEN AVE | 39.250671 | -76.693456 | POINT (-76.693456366667 39.25067135) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 848799 | 2022 | Q3 | Daylight | 1.00 | Anne Arundel | 2.0 | NaN | 0.0 | Not Applicable | 0.0 | ... | Miles | M | E | 607.0 | CO | NaN | OLD STAGE RD | 39.158164 | -76.628893 | POINT (-76.62889296443 39.158163996637) |
| 848800 | 2022 | Q3 | Daylight | 1.00 | Baltimore | 3.0 | NaN | 0.0 | NaN | 88.0 | ... | Feet | F | N | 1300.0 | CO | NaN | REGESTER AVE | 39.378626 | -76.608568 | POINT (-76.608567616667 39.378625966667) |
| 848801 | 2022 | Q3 | Daylight | 1.00 | Baltimore | 3.0 | NaN | 0.0 | Not Applicable | 0.0 | ... | Feet | F | W | 4095.0 | CO | NaN | EBENEZER RD | 39.387539 | -76.418305 | POINT (-76.418304879322 39.387539278395) |
| 848802 | 2022 | Q3 | Daylight | 1.00 | Anne Arundel | 2.0 | NaN | 0.0 | Non Intersection | 1.0 | ... | Feet | F | N | 12.0 | SR | NaN | OCEANIC DR | 39.019090 | -76.403650 | POINT (-76.40365 39.01909) |
| 848803 | 2022 | Q3 | Dark Lights On | 3.00 | Baltimore | 3.0 | NaN | 0.0 | Non Intersection | 1.0 | ... | Feet | F | E | 2189.0 | CO | NaN | PARMELEE RD | 39.367800 | -76.762110 | POINT (-76.76211 39.3678) |
732391 rows × 55 columns
78893
I started out with some basic data exploration. I got the number of crashes for each road documented. I then narrowed down to the 100 most dangerous roads in Maryland as defined by number of crashes. By this metric, Route 1 is the tenth most dangerous road in Maryland from 2015 to 2022. This metric is admittedly flawed, since it doesn't take into account road length or capacity or anything of the sort, but for exploration, seeing where the most crashes take place by sheer number is useful.
roadData = crashes['REFERENCE_ROAD_NAME'].value_counts()
roadData = pd.DataFrame({'road':roadData.index,'crashes':roadData.values})
roadData = roadData.head(100)
display(roadData)
display(roadData[roadData['road']=='BALTIMORE AVE'])
| road | crashes | |
|---|---|---|
| 0 | BALTO BELTWAY | 3111 |
| 1 | ANNAPOLIS RD | 2518 |
| 2 | CAPITAL BELTWAY | 2150 |
| 3 | CONNECTICUT AVE | 2130 |
| 4 | BALTIMORE WASHINGTON PKWY | 2116 |
| ... | ... | ... |
| 95 | MARTIN LUTHER KING JR BLVD | 531 |
| 96 | DULANEY VALLEY RD | 528 |
| 97 | NORTHWEST EXPRESSWAY | 528 |
| 98 | GOOD LUCK RD | 524 |
| 99 | WHITE MARSH BLVD | 517 |
100 rows × 2 columns
| road | crashes | |
|---|---|---|
| 9 | BALTIMORE AVE | 1815 |
To figure out if the number of crashes on any given road is significantly different from the average number of crashes on the 100 most crashed on roads, I did a Z-test, since the sample size is large. I found that Route One does not have significantly more crashes than any of the other top 100 roads. Only the top 6 most dangerous roads have crashes significantly different from the mean.
import scipy.stats
roadData['z-score'] = roadData['crashes'].apply(lambda x:(roadData['crashes'].mean()-x)/roadData['crashes'].std())
roadData['p-value'] = roadData['z-score'].apply(lambda x: scipy.stats.norm.sf(abs(x))*2)
roadData['Significant'] = roadData['p-value'].apply(lambda x: x<.05)
display(roadData.head(10))
| road | crashes | z-score | p-value | Significant | |
|---|---|---|---|---|---|
| 0 | BALTO BELTWAY | 3111 | -4.104533 | 0.000041 | True |
| 1 | ANNAPOLIS RD | 2518 | -2.938238 | 0.003301 | True |
| 2 | CAPITAL BELTWAY | 2150 | -2.214466 | 0.026797 | True |
| 3 | CONNECTICUT AVE | 2130 | -2.175131 | 0.029620 | True |
| 4 | BALTIMORE WASHINGTON PKWY | 2116 | -2.147596 | 0.031746 | True |
| 5 | PENNSYLVANIA AVE | 2115 | -2.145629 | 0.031903 | True |
| 6 | GEORGIA AVE | 1987 | -1.893882 | 0.058241 | False |
| 7 | LIBERTY RD | 1906 | -1.734574 | 0.082816 | False |
| 8 | CENTRAL AVE | 1823 | -1.571332 | 0.116106 | False |
| 9 | BALTIMORE AVE | 1815 | -1.555598 | 0.119804 | False |
Now that I have the 100 most dangerous roads in Marlyand by sheer number of crashes, I want to go back to the original dataset to figure out if I can learn about why these roads are so dangerous. To do this, I took my dataset and removed all entries that didn't take place on my list of the top 100 most dangerous roads.
roads = roadData['road'].tolist()
crashes = crashes[crashes['REFERENCE_ROAD_NAME'].isin(roads)]
display(crashes)
| YEAR | QUARTER | LIGHT_DESC | LIGHT_CODE | COUNTY_DESC | COUNTY_NO | MUNI_DESC | MUNI_CODE | JUNCTION_DESC | JUNCTION_CODE | ... | FEET_MILES_FLAG | DISTANCE_DIR_FLAG | REFERENCE_NO | REFERENCE_TYPE_CODE | REFERENCE_SUFFIX | REFERENCE_ROAD_NAME | LATITUDE | LONGITUDE | LOCATION | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2020 | Q2 | NaN | 6.02 | Baltimore City | 24.0 | NaN | NaN | Non Intersection | 1.0 | ... | M | N | NaN | NaN | NaN | NORTH AVE | 39.311025 | -76.616429 | POINT (-76.616429453205 39.311024794431) | 2020-04-30 06:39:00 |
| 17 | 2021 | Q4 | Daylight | 1.00 | Baltimore | 3.0 | NaN | NaN | Not Applicable | 0.0 | ... | F | N | NaN | NaN | NaN | EASTERN BLVD | 39.322488 | -76.456602 | POINT (-76.456602128426 39.32248773136) | 2021-12-01 16:00:00 |
| 28 | 2021 | Q4 | Daylight | 1.00 | Baltimore City | 24.0 | NaN | 999.0 | Non Intersection | 1.0 | ... | U | N | 40.0 | US | NaN | EDMONDSON AVE | 39.316242 | -76.600602 | POINT (-76.600602337322 39.31624205) | 2021-11-01 08:00:00 |
| 31 | 2021 | Q4 | Daylight | 1.00 | Baltimore | 3.0 | NaN | 0.0 | Intersection | 2.0 | ... | F | N | 1.0 | US | NaN | BELAIR RD | 39.383422 | -76.499001 | POINT (-76.499001276878 39.383421935655) | 2021-10-02 15:19:00 |
| 100 | 2021 | Q3 | Daylight | 1.00 | Baltimore City | 24.0 | NaN | 999.0 | Intersection | 2.0 | ... | F | N | 648.0 | MD | E | ANNAPOLIS RD | 39.283102 | -76.646653 | POINT (-76.646652592591 39.283101915293) | 2021-08-07 07:42:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 848755 | 2022 | Q3 | NaN | 5.02 | Prince George's | 16.0 | NaN | 0.0 | Non Intersection | 1.0 | ... | F | S | 5.0 | MD | NaN | BRANCH AVE | 38.840298 | -76.873562 | POINT (-76.873561666667 38.840298333333) | 2022-07-09 05:14:00 |
| 848756 | 2022 | Q3 | Dark No Lights | 4.00 | Baltimore | 3.0 | NaN | 0.0 | Non Intersection | 1.0 | ... | M | S | 122.0 | MD | NaN | SECURITY BLVD | 39.375950 | -76.722297 | POINT (-76.722296666667 39.37595) | 2022-09-23 18:31:00 |
| 848765 | 2022 | Q3 | Daylight | 1.00 | Baltimore | 3.0 | NaN | 0.0 | Intersection | 2.0 | ... | F | N | 144.0 | MD | NaN | FREDERICK RD | 39.244980 | -76.663258 | POINT (-76.66325815 39.244980016667) | 2022-07-27 12:55:00 |
| 848774 | 2022 | Q3 | Daylight | 1.00 | Carroll | 6.0 | NaN | 0.0 | Intersection | 2.0 | ... | F | N | 26.0 | MD | NaN | LIBERTY RD | 39.457587 | -77.087513 | POINT (-77.087512866667 39.457587416667) | 2022-07-21 16:45:00 |
| 848782 | 2022 | Q3 | Daylight | 1.00 | Prince George's | 16.0 | NaN | 0.0 | Non Intersection | 1.0 | ... | F | W | 295.0 | MD | NaN | BALTIMORE WASHINGTON PKWY | 38.918249 | -76.940636 | POINT (-76.940636 38.918249) | 2022-07-03 17:33:00 |
102406 rows × 56 columns
I mapped all of the crashes from the top ten most crashed on roads on Maryland in order to see if this could provide us any insight about where the most crashes occur. After limiting the crash data to only those belonging to the top ten roads, I assigned a unique color to each road to allow us to differentiate between them more easily.
map_osm = folium.Map(location=[39.0458, -76.6413], zoom_start=9)
roads = roadData['road'].head(10)
top10roads = crashes[crashes['REFERENCE_ROAD_NAME'].isin(roads)]
colors = {'ANNAPOLIS RD':'#E74C3C','LIBERTY RD':'#3498DB','BALTO BELTWAY':'#1ABC9C',
'BALTIMORE AVE':'#000000','CENTRAL AVE':'#F4D03F','CAPITAL BELTWAY':'#E67E22','CONNECTICUT AVE':'#F0F3F4',
'GEORGIA AVE':'#909497','BALTIMORE WASHINGTON PKWY':'#186A3B','PENNSYLVANIA AVE':'#FFFFFF'}
for index,crash in top10roads.iterrows():
folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
color=colors[crash['REFERENCE_ROAD_NAME']]).add_to(map_osm)
map_osm